Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix goroutine leaks in plugin/sampling/strategystore/adaptive #5310

Conversation

WillSewell
Copy link
Contributor

@WillSewell WillSewell commented Mar 29, 2024

Which problem is this PR solving?

Description of the changes

How was this change tested?

  • make test lint

Checklist

@WillSewell WillSewell requested a review from a team as a code owner March 29, 2024 14:32
@WillSewell WillSewell requested a review from jkowall March 29, 2024 14:32
This mainly involved ensuring that all goroutines started by the
Processor are shut down in a Close method (which also blocks on
them returning via a WaitGroup).

Adding this flagged an issue where the `runUpdateProbabilitiesLoop`
had a long delay, so tests need to be able to override the default
Processor.followerRefreshInterval, or they take a long time to run.

Signed-off-by: Will Sewell <[email protected]>
@WillSewell WillSewell force-pushed the fix-goroutine-leaks-in-sampling-strategystore-adaptive branch from 9440ab0 to 250ab88 Compare March 29, 2024 14:33
@yurishkuro
Copy link
Member

Haven't looked at the code yet, but from the description it sounds like the exit is simply not properly implemented if you have to wait for 20sec. If there is a loop with a timer, probably blocking in select, the good approach is to add another "stop" channel that Close function can close and it would cause an exit from select.

@WillSewell
Copy link
Contributor Author

Ah yes, that would be better. I'll have a go at reworking it.

Rather than having to set the "delay phase" to a low value, we
instead make it possible for the `shutdown` channel to unblock
the delay.

Signed-off-by: Will Sewell <[email protected]>
Copy link

codecov bot commented Mar 29, 2024

Codecov Report

Attention: Patch coverage is 91.66667% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 95.12%. Comparing base (7c9dce4) to head (f456039).
Report is 3 commits behind head on main.

Files Patch % Lines
cmd/collector/app/server/test.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5310      +/-   ##
==========================================
+ Coverage   95.06%   95.12%   +0.05%     
==========================================
  Files         340      340              
  Lines       16612    16640      +28     
==========================================
+ Hits        15792    15828      +36     
+ Misses        631      624       -7     
+ Partials      189      188       -1     
Flag Coverage Δ
badger 13.26% <ø> (ø)
cassandra-3.x 26.44% <ø> (ø)
cassandra-4.x 26.44% <ø> (ø)
elasticsearch-5.x 21.70% <ø> (+0.01%) ⬆️
elasticsearch-6.x 21.70% <ø> (ø)
elasticsearch-7.x 21.77% <ø> (-0.02%) ⬇️
elasticsearch-8.x 21.85% <ø> (ø)
grpc 10.95% <ø> (-0.05%) ⬇️
kafka 14.73% <ø> (ø)
opensearch-1.x 21.77% <ø> (-0.02%) ⬇️
opensearch-2.x 21.78% <ø> (+0.01%) ⬆️
unittests 92.29% <91.66%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

}()
defer func() {
close(p.shutdown)
<-done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need done sync here? as long as goroutines are not blocked they will exit and goleak will tolerate that, it does not expect immediate clean state when it's called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for flagging this. They are not required (test pass without it). I was also misunderstanding the goleak semantics.

I think the issue I was having is before dfe7adc, these tests were failing because a goroutine could be blocked in time.Sleep.

I'll remove this.

@@ -24,6 +24,9 @@ import (

// StrategyStore keeps track of service specific sampling strategies.
type StrategyStore interface {
// Close() from io.Closer stops the processor from calculating probabilities.
io.Closer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added this to the interface, but I am not seeing any non-test code that's actually calling it.

Adding Closer to interface it often contentious, some people argue that if you create an object via NewX() *X, you already have the ability to call Close on it without adding Close function to the interface that X implements. This doesn't work well when factories are involved since the factory does return an interface, not an actual struct. One other workaround to that is doing a runtime check for io.Closer interface and only then calling close - this is why I am asking about prod code calling it.

I'm ok to keep io.Closer in the interface because both real implementations are now closable (static store used to not have close before we added file watcher to it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that makes sense.

This doesn't work well when factories are involved since the factory does return an interface, not an actual struct.

Is there a fundamental reason why factories shouldn't return a struct instead of an interface? (Other than it being a breaking change to make in this instance).

One other workaround to that is doing a runtime check for io.Closer interface and only then calling close - this is why I am asking about prod code calling it.

Prod code is not calling Close - do you have a preference between the current implementation vs the runtime check in tests? I don't feel strongly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a fundamental reason why factories shouldn't return a struct instead of an interface?

Yes - polymorphism. The whole point of a factory is to abstract what underlying implementation it creates, which means it always returns an interface.

Prod code is not calling Close

I actually think our pattern is that the main code only calls factory.Close() and the factory is generally responsible for releasing any resources. E.g. we don't call Close on SpanReader that we obtain from the factory.

@yurishkuro yurishkuro added the changelog:test Change that's adding missing tests or correcting existing tests label Mar 29, 2024
@yurishkuro yurishkuro merged commit bae96f7 into jaegertracing:main Mar 29, 2024
36 of 38 checks passed
@WillSewell WillSewell deleted the fix-goroutine-leaks-in-sampling-strategystore-adaptive branch March 29, 2024 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:test Change that's adding missing tests or correcting existing tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants